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Optimal estimation of a coin's bias using noisy data is surprisingly different from the same problem 
with noiseless data. We study this problem using entropy risk to quantify estimators' accuracy. We 
generalize the "add /3" estimators that work well for noiseless coins, and we find that these hedged 
maximum-likelihood (HML) estimators achieve a worst-case risk of 0(N~ 1 ^ 2 ) on noisy coins, in 
contrast to 0(N~ 1 ) in the noiseless case. We demonstrate that this increased risk is unavoidable 
and intrinsic to noisy coins, by constructing minimax estimators (numerically). However, minimax 
estimators introduce extreme bias in return for slight improvements in the worst-case risk. So we 
introduce a pointwise lower bound on the minimum achievable risk as an alternative to the minimax 
criterion, and use this bound to show that HML estimators are pretty good. We conclude with 
a survey of scientific applications of the noisy coin model in social science, physical science, and 
quantum information science. 



I. INTRODUCTION 

Coins and dice lie at the foundation of probability the- 
ory, and estimating the probabilities associated with a 
biased coin or die is one of the oldest and most-studied 
problems in statistical inference. But what if the data are 
noisy - e.g., each outcome is randomized with some small 
(but known) probability before it can be recorded? Esti- 
mating the underlying probabilities associated with the 
"noisy coin" or "noisy die" has attracted little attention 
to date. This is odd and unfortunate, given the range of 
scenarios to which it applies. Noisy data appear in social 
science (in the context of randomized response), in parti- 
cle physics (because of background counts), in quantum 
information science (because quantum states can only be 
sampled indirectly), and in other scientific contexts too 
numerous to list. 

Noisy count data of this sort are usually dealt with in 
an ad hoc fashion. For instance, if we are estimating the 
probability p that a biased coin comes up "heads" , and 
each observation gets flipped with probability a, then 
after N flips, the expectation of the number of "heads" 
observed (n) is not Np but Nq, where 



a 



+ jp(l-2a). 



The ad hoc solution is to simply estimate q = 
invert q(p) to get a linear inversion estimator 



Pli = 



q — a 
1 -2a 



and 



(i) 



Rather awkwardly, though, pli may be negative! Con- 
straining it to the interval [0, 1] fixes this problem, but 
now p is a biased estimator. Whether it is a good esti- 
mator - i.e., accurate and risk- minimizing in some sense 
- becomes hopelessly unclear. 

In this paper, we analyze point estimators for noisy 
binomial data from the ground up, using entropy risk as 



our benchmark of accuracy. Most of our results for coins 
extend to noisy dice (multinomial data) as well. We be- 
gin with a review of optimal estimators for noiseless coins, 
then examine the differences between noisy and noiseless 
data from a statistically rigorous perspective. We show 
how to generalize good noiseless estimators to noisy data, 
compare their performance to (numerical) minimax esti- 
mators for noisy coins, point out the shortcomings in the 
minimax approach, and propose an alternative estimator 
with good performance across the board. 



II. STARTING POINTS: LIKELIHOOD AND 
RISK 

The noisy coin is a classic point estimation problem: 
we observe data n, sampled from one of a parametric 
family of distributions Pr(n\p) parameterized by p, and 
we seek a point estimator p(n) that minimizes a risk func- 
tion R(p;p). The problem is completely denned by (1) 
the sampling distribution, and (2) the risk function. 



A. The sampling distribution and likelihood 
function 

For a noiseless coin, the sampling distribution is 



°r(n\p)= -P) 



N-n 



where N is the total number of samples, and p G [0, 1]. 
Our interest, however, is in the noisy coin, wherein each 
of the N observations gets flipped with probability a G 
[0, |). The sampling distribution is therefore 



Pr(n| ( N n ) :! ■ • 



N-n 



2 



where the "effective probability" of observing heads is 

q(p) = a + p(l — 2a) = p + a (1 — 2p) . 

The noiseless coin is simply the special case where a = 0. 

Everything that the data n imply about the underlying 
parameter p (or q) is conveyed by the likelihood function, 

C(p) = Pr(n\p). 

C measures the relative plausibility of different parame- 
ter values. Since its absolute value is never relevant, we 
ignore constant factors and use 

C(p) = (a + p(l - 2a)) n (1 - a - p(l - 2a)) N ~ n 

Frequentist and Bayesian analysis differ in how C(p) 
should be used. Frequentist estimators, including maxi- 
mum likelihood (ML), generally make use only of C(p). 
Bayesian estimators use B ayes' Rule to combine C(p) 
with a prior distribution Po(p)dp and obtain a posterior 
distribution, 



Pr(p\n) 



£(p)P (p)dp 



/£(p)P„(p)*' 
which determines the estimate. 



Now (returning to the noisy coin), both q and p are 
probabilities, so we could minimize the entropy risk from 
q to or from p to p. But in the problems we consider, 
the events of genuine interest are the underlying (hid- 
den) ones - not the ones we get to observe. So p is the 
operationally relevant probability, not q. This is quite 
important, for if p w then q « a. Suppose the estimate 
is off by e <C a. The risk of that error depends critically 
on whether we compare the g's, 



R(q; q) « R(a + e; a) 



2a(l-a)' 



or the p's, in which case R(p;p) = oo if p — and p ^ 0, 
or (otherwise), 

#(p;p) wii(c;0) «e. 

Entropy risk has special behavior near the state-set 
boundary - i.e., when p w - which has a powerful effect 
on estimator performance. 



III. THE NOISELESS COIN 



B. Risk and cost 

Since there is a linear relationship between q and p, 
we could easily reformulate the problem to seek an esti- 
mator q of q G [a, 1 — a]. Estimation of q and of p are 
equivalent - except for the cost function. If for practical 
reasons we care about q, then the cost function will natu- 
rally depend on q. But if we are fundamentally interested 
in p, then the cost function will naturally depend on p. 
This determines whether we should analyze the problem 
in terms of q or p. It also implies (somewhat counterintu- 
itively) that optimal estimators of p and q need not share 
the linear relationship between the variables themselves. 

Entropy risk provides a good illustration. Suppose 
that the estimator p(n) [or q(n)] is used to predict the 
next event. The cost of event k (e.g. "heads") occurring 
depends on the estimated and is given by — \ogp k . 
(This "log loss" rule models a variety of concrete prob- 
lems, most notably gambling and investment). The ex- 
pected cost depends on the true probability and on the es- 
timate, and equals C = — ^2 k Pk logPfc- Some of this cost 
is unavoidable, for random events cannot be predicted 
perfectly. The minimum cost, achieved when pk = Pk, 
is C m i n = — ^2 k Pk logPfc- The extra cost, which stems 
entirely from inaccuracy in is the [entropy] risk: 

R(p;p) = C - C min = ^Pk (logp/c - logpfe). 

k 

This celebrated quantity, the relative entropy or 
Kullback-Leibler divergence [6] from p to is a good 
measure of how inaccurately p predicts p. 



The standard biased coin, with bias p = Pr( "heads"), 
is a mainstay of probability and statistics. It provides ex- 
cellent simple examples of ML and Bayesian estimation. 

A. The basics 

When a coin is flipped N times and n "heads" are 
observed, the likelihood function is 

£(p)=p n (l-p) N - n , 

and its maximum is achieved at 

n 

PML = Jj. 

So maximum likelihood estimation agrees with the ob- 
vious naive method of linear inversion (Eq. [TJ, which 
equates probabilities with observed frequencies. Risk 
plays no role in deriving the ML estimator, and one ob- 
jection to ML is that since Pr(n = 0\p) is nonzero for 
p > 0, it is quite possible to assign Pml = when p ^ 0, 
which results in infinite entropy risk. The practical prob- 
lem here is that p = may be interpreted as a willingness 
to bet at infinite odds against "heads" coming up - which 
is more or less obviously a bad idea. 

The simplest Bayesian estimator is Laplace's Lawfl2\. 
It results from choosing a Lebesgue ("flat") prior 
Po(p)dp = dp, and then reporting the mean of the poste- 
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rior distribution: 

p = J pPr(p\n)dp 

JpC(p)P (p)dp 
jC(p)P Q (p)dp 

Jp n (l-p) N - n dp 
n + 1 
7V + 2' 

This estimator is also known as the "add 1" rule, for 
it is equivalent to (i) adding 1 fictitious observation of 
each possibility (heads and tails, in this case), then (ii) 
applying linear inversion or ML. 

The derivation of Laplace's Law poses two obvious 
questions. Why report the posterior mean? and why 
use the Lebesgue prior dp? The first has a good answer, 
while the second does not. 

We report the posterior mean because it minimizes the 
expected entropy risk - i.e., it is the Bayes estimator^!] 
for Po(p)dp. To prove this, note that in a Bayesian frame- 
work, we have assumed that p is in fact a random variable 
distributed according to Po(p)dp, and therefore given the 
data n, p is distributed according to Pr(p\n). A simple 
calculation shows that the expected entropy risk, 




R(p;p)Pr(p)dp, 



is minimized by setting p — J pPr(p)dp. The posterior 
mean is Bayes for a large and important class of risk func- 
tions called Bregman divergences, the most prominent of 
which is Kullback-Leibler divergence. We will make ex- 
tensive use of this convenient property throughout this 
paper, but in a broader context it is important to re- 
member that for many other risk functions, the Bayes 
estimator is not the posterior mean. 

For the Lebesgue prior, on the other hand, there is lit- 
tle justification. Lebesgue measure is defined on the reals 
by invoking translational symmetry, which does not ex- 
ist for the probability simplex. In fact, convenience is 
the best argument - it certainly makes calculation easy! 
But this argument can be extended to a large class of 
conjugate priors. Updating a Lebesgue prior in light of 
binomial (Bernoulli) data always yields a Beta distribu- 
tion [7 as the posterior, 

Pr(p) (xp^- x (l -p) 7_1 (jp. 

The family of Beta distributions is closed under the op- 
eration "add another observation", which multiplies C(p) 
by p or (1 — p). So the same posterior form is obtained 
whenever the prior is a Beta distribution. The conjugate 
priors for binomial data are therefore Beta distributions. 

Any Beta prior is a convenient choice, but only those 
that maintain symmetry between heads and tails can be 
considered "noninformative" priors, e.g. 

P (p)oc/- 1 (l-p)' 3 - 1 , 



for any real f3 > 0. The Bayes estimator for such a prior 
is (via Bayes Rule and the posterior mean), 

A n + /3 

a rule known (for obvious reasons) as "add /?" , or (more 
obscurely) as Lidstone's LawfF2\. 

The "add /3" estimators (including Laplace's Law) are 
convenient and sensible generalizations of the linear in- 
version estimator. As we shall see later, they also gen- 
eralize the ML estimator in an elegant way. Better yet, 
they are Bayes estimators for Beta priors (with respect to 
entropy risk). However, there are infinitely many other 
priors, with their own unique Bayes estimators. 

B. Minimax and the noiseless coin 

For a true Bayesian, the previous section's analysis 
stands alone. Every scientist or statistician must look 
into his or her heart, find the subjective prior therein, 
and implement its Bayes estimator. For less commit- 
ted Bayesians, the minimax criterion [11 can be used to 
select a unique optimal estimator. The essence of mini- 
max reasoning is to compare different estimators not by 
their average performance (over p), but by their worst- 
case performance. This eliminates the need to choose a 
measure (prior) over which to average. Since this is a 
fundamentally frequentist notion, it is all the more re- 
markable that the minimax estimator is - for a broad 
range of problems - actually the Bayes estimator for a 
particular prior! 

This theorem is known as minimax-Bayes duality [TI\. 
The minimax estimator is, by definition, the estimator 
p(-) that minimizes the maximum risk (thus the name): 

#max \p(-)} = max V ]Pr(n\p)R(p(n)',p) 

n 

Minimax-Bayes duality states that, as long as certain 
convexity conditions are satisfied (as they are for entropy 
risk), the minimax estimator is the Bayes estimator for 
a least favorable prior (LFP), P WO rst(p)dp- 

This duality has some useful corollaries. The Bayes 
risk of the LFP (the minimum achievable average risk, 
which is achieved by the LFP's Bayes estimator) is equal 
to the minimax risk. So the pointwise risk R(pB&yes(n)',p) 
is identically equal to i? max at every support point of the 
LFP. Furthermore, the LFP has the highest Bayes risk 
of all priors. This means that the Bayes estimator of any 
prior Po(p)dp yields both upper and lower bounds on the 
minimax risk. 

-^Bayes (-^o) — -^minimax < -Rmax(-fb) (2) 

These bounds coincide only for least favorable priors and 
minimax estimators. 

Quite a lot is known about minimax estimators for the 
noiseless coin. For small TV, minimax estimators have 
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been found numerically using the "Box simplex method" 
[14J[T5] . For large N, the binomial distribution of n is well 
approximated by a Poisson distribution with parameter 
A = Np. By applying this approximation, and lower- 
bounding the maximum risk by the Bayes risk of the 
"add 1" estimator, it has been shown [13] 



(•)]> 



27V 



O 



1 



This limit can be achieved via the peculiar but simple 
estimator |3j[4] 



p(ri) 



( n+l/2 
AT+5/4 

n+1 
AT+7/4 
n+3/4 
AT+7/4 
n+3/4 
iV+5/4 
n+3/4 
AT+3/2 



if n = 0, 
if n = 1, 
if n = TV — 
if n = TV, 
otherwise. 



i, 



This estimator is very nearly "add 3/4", and in fact the 
"add /3" estimators are almost minimax. "Add 1/2", 
which corresponds to the celebrated Jeffreys' Prior [8], is 
a de facto standard. However, it has been shown [lOl [13] 
that the very best "add /?" estimator (although not gm£e 
minimax) is the one with f3 = /3o = 0.509..., whose 
asymptotic risk is 



•)] = A 



1 

TV 



O 



1 

TV2 



These estimators hedge effectively against unobserved 
events (in contrast to ML, which assigns p = if n = 0), 
and we will strive to generalize them for the noisy coin. 



IV. THE NOISY COIN 

Adding noise - random bit flips with probability a - 
to the data separates the effective probability of "heads" , 



q = a - 2a), 



(3) 



from the true probability p. Quite a lot of the compli- 
cations that ensue can be understood as stemming from 
a single underlying schizophrenia in the problem: there 
are now two relevant probability simplices, one for q and 
one for p (see Figure [I]). 

One part of the problem (the data, and therefore the 
likelihood function) essentially live on the g-simplex. The 
other parts (the parameter to be estimated, and therefore 
the risk function) live on the p-simplex. The simplest, 
most obvious estimator is linear inversion, 



q= N> 

which implies (by inversion of Eq. [3|, 



q — a 
1 -2a' 



(4) 



{Pk} > o 




Danger, Will Robinson! 



AAAAAAAAAAAAAA 







n 



N 



FIG. 1: Most of the complications in noisy coin estimation 
come from the existence of - and linear relationship between 
—two simplices, one containing all the true probability distri- 
butions {pfc}, and the other containing all effective distribu- 
tions {qk}- Above (top), this is illustrated for a 3-sided die 
(not considered in this paper). Below, an annotated diagram 
of sampling mismatch for a coin shows how standard errors 
for the linear inversion estimator can easily extend outside 
the p-simplex. 



But if n < aN or n > (1 — a) iV, then p is negative or 
greater than 1. This is patently absurd. It is also clearly 
suboptimal, as there is no advantage to assigning an es- 
timate that lies outside the (convex) set of valid states. 
Finally, it guarantees infinite expected risk - which is to 
say that it is quantitatively very suboptimal. 

Maximum likelihood does not improve matters much. 
A bit of algebra shows that, just as for the noiseless coin, 
ML coincides with linear inversion - if Pml is a valid 
probability. Otherwise, C(p) achieves its maximum value 
on the boundary of its domain, at or 1. 



Pml 





n—aN 
N(l-2a) 
1 



if n < aN 
if aN < n < N(l - a) 
if n > N(l - a) 



The ML estimator is always a valid probability (by con- 
struction, since the domain of C{p) is the simplex). How- 
ever, like linear inversion, it is still clearly suboptimal. It 
is never risk-minimizing to assign p = unless we are 
certain that p truly is zero. Moreover, the expected en- 
tropy risk is still infinite under all circumstances, since 
p = occurs with nonzero probability and R(0;p) = oo. 

For the noiseless coin the only way that Pml = is 
if n = is observed. For the noisy coin, Pml = for a 
whole range of data - and with probability close to 50% 
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when p <C In the noiseless context, adding fictious 
observations solved this problem quite well, generating 
"add /?" estimators with near-optimal performance. Un- 
fortunately, the addition of fictitious observations 

n k n k + P 

doesn't have the same effect for the noisy coin. If we have 
n < aN — j3 (which is possible and even probable), then 
adding f3 has no effect. Linear inversion still gives p < 0, 
and ML still gives p = 0. 

The core problem here is sampling mismatch. We sam- 
ple from q, but the risk is determined by p. ML takes no 
account of the risk function, and neither does our attempt 
to hedge the estimate by adding fake samples. Both are 
entirely g-centric. A mechanism that acts directly on the 
p-simplex is needed, to force p away from the dangerous 
boundaries (p = and p = 1, where the risk diverges). 

One simple way to do this is to modify C(p), multi- 
plying it by a "hedging function" [2 h(p) = YlkPk = 

/'-■si /')•■• 

C(p)^C'(p)=p^l-pfC(p), 

and define pp = argmax/y(£>) (see Fig. [2|. For a noiseless 
coin, this is identical to adding j3 fictitious observations 
of each possible event - but for a > 0, they are not 
equivalent. The hedging function modification is sensi- 
tive to the pk = boundary of the simplex, and inex- 
orably forces the maximum of C(p) away from it (since 
C'(p) remains log-convex, but equals zero at the bound- 
ary). 

The HML estimator is given by 



a 



1 -2a 



where qp is the zero of the cubic polynomial 

(N+2P)f-(N+n+3P)q 2 +(n+P+Na-Na 2 )q+na 2 -na. 

that lies in [a, 1 — a]. 

Figure [3] illustrates how pp and pml depend on 
When Pml is far from the simplex boundary (0 and 1), 
hedging has relatively little effect. In fact, hedging yields 
an approximately linear estimator akin to "add /?" . But 
as Pml approaches or 1, the effect of hedging increases. 
When pml intersects 0, at n = aN, pp — 0(l/y/~N). 
This fairly dramatic shift occurs because the likelihood 
function is approximately Gaussian, with a maximum at 
p = and a width of 0(l/y/~N). C(p) declines rather 
slowly from p = 0, and p = 0(l/y/~N) is not substan- 
tially less likely than p = 0, so the hedging imperative to 
avoid p = pushes the maximum of C'(p) far inside the 
simplex. 

How accurate are these hedged estimators? Figure [4] 
shows the pointwise average risk as a function of the true 
p, for different amounts of noise (a) and hedging (/?). For 
the noiseless (a = 0) coin, /3 = 1/2 yields a nearly fiat 
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FIG. 2: Illustration of a hedging function (/? = 0.1), and 
its effect on the likelihood function for noiseless (top) and 
noisy (bottom) coins. In the top plot, we show the hedging 
function over the 2-simplex (dotted black line), the likelihood 
function for an extreme data set comprising 10 heads and 
tails (red line), and the corresponding hedged likelihood (blue 
line). The lower plot shows the same functions for a noisy 
coin with a = 0.1. The shaded regions are outside the p- 
simplex (and therefore forbidden), but correspond to valid q 
values. Note that the unconstrained maximum of C lies in the 
forbidden region where p > 1, and therefore the maximum of 
the constrained likelihood is on the boundary (p = 1) - a 
pathology that hedging remedies. 



risk profile given by R(p) « 1/2N. In contrast, hedged 
estimators for the noiseless coin yield similar profiles that 
rise from 0(1/N) in the interior to a peak of 0(1/ VN) 
around p = 0(l/y/~N). Risk at the p = boundary 
depends on (3, and may be either higher or lower than 
the peak at p = 0(1/ VN). 

At first glance, this behavior suggests a serious flaw in 
the hedged estimators. The peak around p w 0(1/ y/N) 
is of particular concern, since in all cases the risk is 
0(1/ y/N) there. But in fact, this behavior is generic 
for the noisy coin. Minimax estimators have similar 
0(1/ y/N) errors, and hedged estimators turn out to per- 
form quite well. However, they are not minimax, or even 
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FIG. 3: HML (hedged maximum likelihood) estimators ftp (n) 
are shown for several values of /3, and compared with the 
maximum likelihood (ML) estimator pml(ti). N = 100 in 
all cases. Whereas the ML estimator is linear in n until it 
encounters p = 0, the hedged estimator smoothly approaches 
p = as the data become more extreme. Increasing f3 pushes 
ft away from ft = 0. 



close to it! As we shall show in the next section, the 
noisy coin's "intrinsic risk" profile is far from flat. The 
minimax estimator attempts to flatten it - at substantial 
cost. 
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V. MINIMAX ESTIMATORS FOR THE NOISY 
COIN 

Some simple estimators (such as "add 1/2") are nearly 
minimax for the noiseless coin. This is not true for the 
noisy coin in general, because (as we shall see) the min- 
imax estimators are somewhat pathological. So we used 
numerics to find good approximations to minimax esti- 
mators for noisy coins. 

Minimax- B ayes duality permits us to search over pri- 
ors rather than estimators. Each prior 7r(p)dp defines 
a Bayesian mean estimator p^ri), which is Bayes for 
7r(p). Its risk profile Rp 7T (p) provides both upper and 
lower bounds on the minimax risk (Eq. [2J. 

As is often the case for discretely distributed data, the 
minimax priors for coins appear to always be discrete 
[TT] . We searched for least favorable priors (holding TV 
and a fixed) using the algorithm of Kempthorne [9 . We 
defined a prior with a few support points, and let the 
location and weight of the support points vary in order 
to maximize the Bayes risk. Once the optimization equi- 
librated, we added new support points at local maxima 
of the risk, and repeated this process until the algorithm 
found priors for which the maximum and Bayes risk co- 
incided to within 10 -6 relative error. 

Figure [5] illustrates minimax estimators for several TV 
and a, while Figure [6] shows the resulting risk profiles. 
The minimax risk is 0(1/ VN) - not 0(1 /N) as for the 



FIG. 4: The risk profile R(p) is shown for several HML esti- 
mators. TV — 100 in all cases. In the top plot, a noisy coin 
with a — 1/4 has been estimated using three different HML 
estimators. The optimal /3 balances boundary risk against 
interior risk. Increasing f3 increases boundary risk, while de- 
creasing it increases interior risk. The bottom plot shows the 
risk of optimal HML estimators for a = 0, 1/100, 1/10, 1/4. 
Risk approaches 0(l/y/~N) for noisy coins, vs. 0(1/N) for 
noiseless coins. 



noiseless coin. The risk is clearly dominated by points 
near the boundary, and we find that the LFPs typically 
place almost all their weight on support points within a 
distance 0(1/ 'y/~N) of the boundary (p = and p = 1). 
As a result of this severe weighting toward the bound- 
ary, the minimax estimators are highly biased toward 
p ^ 1/y/N - not just when p is close to the boundary 
(when bias is inevitable) but also when p is in the in- 
terior! This effect is truly pathological, although it can 
easily be explained. Low risk, of order 1/TV, can easily 
be achieved in the interior. However, the minimax es- 
timator seeks at all costs to reduce the maximum risk, 
which is achieved near p « 1/y/N. By biasing heavily 
toward p « 1/ V^V, the estimator achieves slightly lower 
maximum risk. . . at the cost of dramatically increasing 
its interior risk from 0(1/N) to 0(l/y/~N). 

The preceding analysis made use of an intuitive notion 
of pointwise "intrinsic risk" - i.e., a lower bound R m ' m (p) 
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a = 1/4 

a = 1/10 

> a = (noiseless) 



FIG. 5: Minimax and ML estimators are shown for N — 
100 and a — 1/10, 1/4. Note that the minimax estimator is 
grossly biased in the interior - a pathological result of the 
mandate to minimize maximum risk at all costs. 



on the expected risk for any given p. Formally, no such 
lower bound exists. We can achieve R(p') = for any 
p', simply by using the estimator p = p' . But we can 
rigorously define something very similar, which we call 
bimodal risk. 

The reason that it's not practical to achieve R(p') = 
at any given p' is, of course, that p' is unknown. We must 
take into account the possibility that p takes some other 
value. Least favorable priors are intended to quantify the 
risk that ensues, but a LFP is a property of the entire 
problem, not of any particular p' . In order to quantify 
"how hard is a particular p' to estimate," we consider the 
set of bimodal priors, 

Kw i p' i p"(p) = wS(p-p f ) + (1 -w)5(p-p"), 

and maximize Bayes risk over them. We define the bi- 
modal risk of p' as 



max Rjr . 

w,p" w > p 



The bimodal risk quantifies the difficulty of distinguish- 
ing p' from just one other state p" . As such, it is always 
a lower bound on the minimax risk. 

Figure [7] compares the bimodal risk to the pointwise 
risk achieved by the minimax and [optimal] HML esti- 
mators. Note that the bimodal risk function is a strict 
lower bound (at every point) for the minimax risk, but 
not for the pointwise risk of any estimator (including the 
minimax estimator). However, every estimator exceeds 
the bimodal risk at at least one point, and almost cer- 
tainly at many points. Figure [7] confirms that the noisy 
coin's risk is dominated by the difficulty of distinguishing 
p « 1/y/N from p = 0. States deep inside the simplex are 
far easier to estimate, with an expected risk of 0(1/N). 

A simple analytic explanation for this behavior can 
be obtained by series expansion of the Kullback-Leibler 
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FIG. 6: Above (top), the risk profile R(p) is shown for the 
minimax estimators of Figure [5] (N = 100; a = 1/10, 1/4) and 
for a noiseless coin (also N — 100). No estimator can achieve 
lower risk across the board - but the minimax estimator's 
risk is very high in the interior (p > 1/y/N) compared with 
HML estimators. The 0(1/ VN) risk of HML estimators is 
intrinsic to noisy coins. Below, we show the least favorable 
priors whose Bayes estimators have the risk profiles above. 
The support points of the discrete priors occur, as they must, 
where the risk achieves its maximum value. Thus, the Bayes 
risk is equal to maximum risk (within a relative tolerance of 
MT 6 ). 



divergence, 



KL(p + e;p) 



2p(l-p) 



The typical error, e, in the estimator is 0(1/ vN), so as 
long as both p and 1 — p are bounded away from zero, 
the typical entropy risk is 0(1/N). But as p approaches 

(or 1), this approximation diverges. If we consider p = 

1 / y/N, then we get 



KL(p + e-p) 



2p 



0(1/VN). 



At p = 0, the series expansion fails, but a one-sided 
approximation (with e strictly greater than 0) gives the 
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FIG. 7: The risk profiles of the optimal HML estimator (red) 
and the minimax estimator (blue) are compared with the bi- 
modal lower bound R2(p)- Note that while the HML maxi- 
mum risk exceeds the minimax risk (as it must!), it is com- 
petitive - and HML is far more accurate in the interior. The 
bimodal lower bound supports the conjecture that HML is a 
good compromise, since the HML risk exceeds the bimodal 
bound by a nearly constant factor. 



same result, 

KL(e]0)^€ = O{l/y/N). 

For the noiseless coin, this does not occur because the 
typical error is not always 0(1 /y/N). Instead, it scales 
with the inverse of the Fisher information, 

_ / 2p(l-p) 

and the p-dependent factor neatly cancels its counterpart 
in the Kullback-Leibler divergence, which produces the 
nice flat risk profile seen for the noiseless coin. The un- 
derlying problem for the noisy coin is - again - sampling 
mismatch. Because we are observing q and predicting p, 
the two factors do not cancel out. 



VI. GOOD ESTIMATORS FOR THE NOISY 
COIN 

Minimax is an elegant concept, but for the noisy coin 
it does not yield "good" estimators. In a single-minded 
quest to minimize the maximum risk, it yields wildly bi- 
ased estimates in the interior of the simplex. This is 
reasonable only in the [implausible] case where p is truly 
selected by an adversary (in which case the adversary 
would almost always choose p near the boundary, out of 
sheer bloody- mindedness). In the real world, robustness 
against adversarial selection of p is good, but should not 
be taken to absurd limits. 

This leaves us in need of a quantitative criterion for 
"good" estimators. Ideally, we would like an estimator 
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FIG. 8: Here, we show how three aspects of optimal HML 
estimation behave as N increases toward infinity. They are: 
(1) the risk deep in the bulk (blue); (2) the bimodal risk at 
the boundary (red); and (3) the optimal value of f3 (green). 
The data shown here, N — 10 . . . 10 5 , correspond to a noise 
level of a — 1/100, but other values of a produce qualitatively 
identical results. The "optimal" /3 is the one that achieves the 
smallest maximum risk, by balancing the risk at p — against 
its maximum value in the interior of the simplex. Its value 
decreases with N for small N, and for N ^> a -1 it asymptotes 
to Optimal « 0.0389 (see also Fig. [9|. The HML estimator's 
expected risk varies quite dramatically with p (as expected) . 
At p = 0, the risk asymptotes to R(0) « 1/AVN. At p = 1/2, 
it asymptotes to R(l/2) « 1/N. The 0(1/ y/N) expected risk 
is seen only for p e [0, 0(1)/ y/N] . 

that achieves (or comes close to achieving) the "intrinsic" 
risk for every p. The bimodal risk R2&) provides a rea- 
sonably good proxy - or, more precisely, a lower bound 
- for intrinsic risk (in the absence of a rigorous defini- 
tion). This is not a precise quantitative framework, but 
it does provide a reasonably straightforward criterion: we 
are looking for an estimator that closely approaches the 
bimodal risk profile. 

Hedged estimators are a natural ansatz, but we need 
to specify j3. Whereas the noiseless coin is fairly accu- 
rately estimated by f3 = 1/2 for all TV, the optimal value 
of (3 varies with N for noisy coins. Local maxima of the 
risk are located at p = and at p ~ ly/N, one or both 
of which is always the global maximum. So, to choose 
/?, we minimize maximum risk by setting them equal to 
each other. Figure [8] shows the optimal f3 as a function 
of N, for a representative value of a, which approaches 
/^optimal ~ 0.0389 for large N. This value is obtained 
for a large range of a's, as shown in Figure [9j Finally, 
Figure [7] compares the risk profile of the optimal hedg- 
ing estimator with (i) bimodal risk, and (ii) the risk of 
the minimax estimator. We conclude that while optimal 
hedging estimators probably do not offer strictly optimal 
performance, they are (i) easy to specify and calculate, 
(ii) far better than minimax estimators for almost all 
values of p, and (iii) relatively close to the lower bound 
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FIG. 9: The optimal value of /3 for N = 2 ... 2 17 and a = 
2 -12 . . . 2 -2 . It approaches the optimal noiseless value « 1/2 
when iVCa -1 . It rapidly declines and at roughly N « a -1 
it is well within 10 -2 of what appears to be its asymptotic 
value /^optimal ~ 0.0389. 



defined by bimodal risk. 



VII. APPLICATIONS AND DISCUSSION 

Our interest in noisy coins (and dice) stems largely 
from their wide range of applications. Our original moti- 
vation came from quantum state and process estimation 
(see, e.g. [2 , as well as reference therein). However, 
noisy coins appear in several other important scientific 
contexts. 



A. Randomized response 

Social scientists often want to know the prevalence of 
an embarrassing or illegal habit - e.g., tax evasion, abor- 
tion, infidelity - in a population. Direct survey yields 
negatively biased results, since respondents often do not 
trust guarantees of anonymity. Randomized response [5] 
avoids this problem by asking each respondent to flip a 
coin of known bias a and invert their answer if it comes 
up heads. An adulterer or tax cheat can answer honestly, 
but claim (if confronted) that they were lying in accor- 
dance with instructions. The scientist is left with the 
problem of inferring the true prevalence (p) from noisy 
data - precisely the problem we have considered here. 
Interval estimation (e.g., confidence intervals) are a good 
alternative, but if the study is intended to yield a point 
estimate, our analysis is directly applicable. 



B. Particle detection 

Particle detectors are ubiquitous in physics, ranging 
from billion-dollar detectors that count neutrinos and 
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FIG. 10: The state space of a qubit comprises all 2 x 2 positive 
semidefinite, trace- 1 density matrices, and is usually repre- 
sented as the Block ball (shown here in cross-section). Quan- 
tum state tomography of a single qubit normally proceeds by 
measuring three 2-outcome observables, a x , a z , and a y (not 
shown). Each can be seen as an independent "coin", except 
that their biases are constrained (by positivity of the density 
matrix) to the Bloch ball. When the density matrix is nearly 
pure (i.e., close to the surface of the Bloch ball), and not di- 
agonal in one of the measured bases, the qubit behaves much 
like a noisy coin. As shown, linear inversion can easily yield 
estimates outside the convex set of physical states (analogous 
to the p-simplex). 



search for the Higgs boson, to single-photon photode- 
tectors used throughout quantum optics. The number of 
counts is usually a Poisson variable, which differs only 
slightly from the binomial studied here. Background 
counts are an unavoidable issue, and lead to an estima- 
tion problem essentially identical to the noisy coin. Parti- 
cle physicists have argued extensively (see, e.g., p~6| [T7], 
and references therein) over how to proceed when the 
observed counts are less than the expected background. 
Our proposed solution (if a point estimator is desired - 
region estimation, as in [18], may be a better choice) is to 
use HML or another estimator with similar risk profile - 
and to be aware that the existence of background counts 
has a dramatic effect on the expected risk. 



C. Quantum state (and process) estimation 

The central idea of quantum information science [19] 
is to encode information into pure quantum states, then 
process it using unitary quantum gates, and thus to ac- 
complish certain desirable tasks (e.g., fast factoring of 
integers, or secure cryptography). 

One essential step in achieving these goals is the ex- 
perimental characterization of quantum hardware using 
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state tomography [21] and process tomography (which is 
mathematically isomorphic to state tomography [20]). 

Quantum tomography is, in principle, closely related 
to probability estimation (see Fig. [To] ) . The catch is that 
quantum states cannot be sampled directly. Character- 
izing a single qubit's state requires measuring three inde- 
pendent 1-bit observables. Typically, these are the Pauli 
spin operators, {cr Xl o- yi cr z }, which behave like coin flips 
with respective biases q x ^Qy^Qz- Furthermore, even if the 
quantum state is pure (i.e., it has zero entropy, and some 
measurement is perfectly predictable), no more than one 
of these "coins" can be deterministic - and in almost all 
cases, all three of q x ,q y , q z are different from and 1. So, 
just as in the case of the noisy coin, we have pure states 
(analogous to p = 0) that yield somewhat random mea- 
surement outcomes (q =^ 0). The details are somewhat 
different from the noisy coin, but sampling mismatch re- 
mains the essential ingredient. 

The implications of our noisy coin analysis for quantum 
tomography are more qualitative than in the applications 
above. Quantum states are not quite noisy coins, but 
they are like noisy coins. In particular, the worst-case 



risk must scale as 0(1 /y/N) rather than as 0(1/N), for 
any possible fixed choice of measurements. Finding the 
exact minimax risk will require a dedicated study, but the 
analysis here strongly suggests that HML (first proposed 
for precisely this problem in [2 ) will perform well. 

Even more interesting is the implication that adap- 
tive tomography can offer an enormous reduction in risk. 
This occurs because for quantum states - unlike noisy 
coins - the amount of "noise" is under the experimenter's 
control. By measuring in the eigenbasis of the true state 
p, sampling mismatch can be eliminated! However, this 
requires knowing (or guessing) the eigenbasis. Bagan et 
al pQ observed this, via a somewhat different analysis, 
and also demonstrated that in the N — >> oo limit it is suf- 
ficient to adapt once. An extension of the analysis pre- 
sented here should be able to determine near-minimax 
adaptive strategies for finite N. 
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